最近团队在APM方向发力,需要在产品的深度学习模型的速度和占用空间大小两个维度来进行提升

目前使用的是Tensorflow Lite的格式在进行模型运算,想通过Tensorflow官方在2017年推出的预研项目XLA对模型进行优化,在官方示例过程的结论中模型使用XLA/AOT优化的模型比之前使用.pb格式的模型运行速度会提升10%~200%(有个别情况),占用空间会有4x的缩小

本文将逐一展示完整优化过程及遇到的坑(解决方案)。

Step -1: 使用XLA将模型编译为AOT(ahead-of-time)代码的步骤

  1. 编译tfcomfile

  2. 固化模型

  3. graph.config.pbtxt

  4. 编写bazel BUILD脚本

  5. 编译对应平台二进制文件 .o .h

  6. 编写代码调用AOT模型

  7. 编写BUILD

  8. 编译对应平台最终产物 .so

    环境

    • Ubuntu18.04
    • Bazel 0.24
    • jdk 8
    • NDK
    • SDK

    文件目录

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    //tensorflow/compiler/aot/
    │ aot_only_var_handle_op.cc
    │ benchmark.cc
    │ benchmark.h
    │ benchmark_main.template
    │ benchmark_test.cc
    │ BUILD
    │ codegen.cc
    │ codegen.h
    │ codegen_test.cc
    │ codegen_test_h.golden
    │ codegen_test_o.golden
    │ compile.cc
    │ compile.h
    │ embedded_protocol_buffers.cc
    │ embedded_protocol_buffers.h
    │ flags.cc
    │ flags.h
    │ test.cc
    │ test_graph_tfadd.config.pbtxt
    │ test_graph_tfadd.pbtxt
    │ test_graph_tfunknownop.config.pbtxt
    │ test_graph_tfunknownop.pbtxt
    │ test_graph_tfunknownop2.config.pbtxt
    │ test_graph_tfunknownop3.config.pbtxt
    │ tfcompile.bzl
    │ tfcompile_main.cc
    ├─custom
    │ │ BUILD
    │ │ com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.cc
    │ │ com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.h
    │ │ custom_interface.config.pbtxt
    │ │ custom_interface_lib.h
    │ │ custom_interface_tfcompile_function.o
    │ │ custom_interface_tfcompile_metadata.o
    │ │ debug.cc
    │ │ debug.h
    │ │ figure-65.png
    │ │ figure-66.jpg
    │ │ frozen_custom_010.pb
    │ │ input_image.py
    │ │ libcustom_interface.a
    │ │ libcustom_interface.pic.a
    │ │ libcustom_interface.so
    │ │ lib_custom_interface.so
    │ │ log.h
    │ │ log_stream.h
    │ │ out.h
    │ │ out_helper.o
    │ │ out_model.o
    │ │ predict_model.py
    │ │ Screenshot_67.jpg
    │ │ Screenshot_68.png
    │ │ tfcompile_h_o.py
    │ │ __init__.py
    │ │
    │ ├─arm64-v8a
    │ │ libcustom_interface.so
    │ │ lib_custom_interface.so
    │ │
    │ └─armeabi-v7a
    │ libcustom_interface.so
    │ lib_custom_interface.so
    └─tests
    BUILD
    make_test_graphs.py
    test_graph_tfadd.config.pbtxt
    test_graph_tfadd_with_ckpt.config.pbtxt
    test_graph_tfassert_eq.config.pbtxt
    test_graph_tfcond.config.pbtxt
    test_graph_tffunction.config.pbtxt
    test_graph_tfgather.config.pbtxt
    test_graph_tfmatmul.config.pbtxt
    test_graph_tfmatmulandadd.config.pbtxt
    test_graph_tfsplits.config.pbtxt
    test_graph_tftop_k.config.pbtxt
    test_graph_tfvariable.config.pbtxt
    test_graph_tfvariable_sequential_updates.config.pbtxt
    tfcompile_test.cc

Step 0: 编译tfcomfile

  • 首先编译tfcomfile其实就是编译tensorflow源码中的一部分,但这一部分的编译却需要整个工程的依赖

    1. 下载源码

      1
      git clone --recurse-submodules https://github.com/tensorflow/tensorflow

      其中–recurse-submodules参数是必须的,用于获取TensorFlow依赖的protobuf库.

    2. 配置TensorFlow

      1
      2
      cd ~/tensorflow
      ./configure

      需要注意配置编译项的一些规则

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      You have bazel 0.25.0 installed.
      Please specify the location of python. [Default is C:\ProgramData\Anaconda3\python.exe]:
      Found possible Python library paths:
      C:\ProgramData\Anaconda3\lib\site-packages
      Please input the desired Python library path to use. Default is [C:\ProgramData\Anaconda3\lib\site-packages]
      Do you wish to build TensorFlow with XLA JIT support? [y/N]: Y
      XLA JIT support will be enabled for TensorFlow.
      Do you wish to build TensorFlow with ROCm support? [y/N]: N
      No ROCm support will be enabled for TensorFlow.
      Do you wish to build TensorFlow with CUDA support? [y/N]: N
      No CUDA support will be enabled for TensorFlow.
      Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is
      /arch:AVX]:
      Would you like to override eigen strong inline for some C++ compilation to reduce the compilation time? [Y/n]: N
      Not overriding eigen strong inline, some compilations could take more than 20 mins.
      Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .ba
      zelrc for more details.
      --config=mkl # Build with MKL support.
      --config=monolithic # Config for mostly static monolithic build.
      --config=gdr # Build with GDR support.
      --config=verbs # Build with libverbs support.
      --config=ngraph # Build with Intel nGraph support.
      --config=numa # Build with NUMA support.
      --config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
      --config=v2 # Build TensorFlow 2.x instead of 1.x.
      Preconfigured Bazel build configs to DISABLE default on features:
      --config=noaws # Disable AWS S3 filesystem support.
      --config=nogcp # Disable GCP support.
      --config=nohdfs # Disable HDFS support.
      --config=noignite # Disable Apache Ignite support.
      --config=nokafka # Disable Apache Kafka support.
      --config=nonccl # Disable NVIDIA NCCL support.
      Configuration finished

      最需要注意的是其中的:Do you wish to build TensorFlow with XLA JIT support? [y/N]: Y!!!这是使用XLA的关键

    3. 开始编译tfcompile

      tfcompile的Bazel脚本入口在//tensorflow/compiler/aot:tfcompile

      执行命令进行编译

      1
      bazel build //tensorflow/compiler/aot:tfcompile

    讲一下这一步骤中遇到的坑

    • 首先编译源码的过程中不同资源是异步编译的经常会出现找不到包的问题,可以多尝试执行上一命令重复进行编译即可
    • 在编译过程中会首先去请求下载各种依赖,这一过程中会出现依赖方已升级版本但本地请求需要验证SHA256(本地SHA256是在代码中写死的)不匹配这一问题,可以根据日志定位依赖下载脚本位置修改对应SHA256值即可
    • bazel编译过程在下载依赖后是有缓存机制的不必担心下载后的依赖丢失(前提是不执行bazel clean
    • 在编译tfcompile过程中如果遇到不论在什么网络状态在都无法编译通过的问题尝试bazel clean后重新执行命令或在项目根目录执行bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package进行依赖下载构建依赖包成功后再次执行tfcompile编译命令就会顺畅很多

Step 1: 固化模型

这一步中需要将graph与checkpoints冻结形成.pb格式的固化模型这是之后形成二进制文件的原料

这一步很容易没有什么坑 ;)

Step 2: graph.config.pbtxt

这一步中主要是为了形成graph的描述,即注明该graph的输入节点的节点名、入参tensor大小,输出节点的节点名等参数

有两种方式可以形成该描述

  1. 使用源码中的工具进行自动化形成(需要编写代码)

    使用源码中的tf2xla_pb2.py可进行一些操作形成该描述

    这里介绍另一种直观且快速的方式

  2. 通过可视化工具确定入口出口

    NETRON

    通过该工具将模型导入后可扩展内部所有节点的描述,即可以得到输入节点与输出节点的描述,进行对应节点描述文件编写

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    # Each feed is a positional input argument for the generated function. The order
    # of each entry matches the order of each input argument. Here “x_hold” and “y_hold”
    # refer to the names of placeholder nodes defined in the graph.
    feed {
    id { node_name: "input" }
    shape {
    dim { size: 1 }
    dim { size: 160 }
    dim { size: 160 }
    dim { size: 3 }
    }
    }
    # Each fetch is a positional output argument for the generated function. The order
    # of each entry matches the order of each output argument. Here “x_y_prod”
    # refers to the name of a matmul node defined in the graph.
    fetch {
    id { node_name: "MobilenetV2/Predictions/Reshape_1" }
    }

将该文件保存为graph.config.pbtxt

该步骤完成

WARNING: 需要注意在该描述文件中添加注释时只可使用#作为标示,否则编译不过且定位不到问题位置

Step 3: 编写bazel BUILD脚本

在这一步骤中将进行编写编译脚本很简短的配置但有一些细节需要注意:

  • 建议在//tensorflow/compiler/aot下建立自己的floder将以上生成的产物放入其中,以下操作默认操作在//tensorflow/compiler/aot/custom下进行

  • 在custom目录下创建BUILD脚本文件

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    load("//tensorflow/compiler/aot:tfcompile.bzl", "tf_library")
    tf_library(
    name = "custom_interface",
    cpp_class = "Classifier",
    graph = "frozen_custom_010.pb",
    config = "graph.config.pbtxt",
    )

    内部参数解释:

    • name:即执行编译后将生成产物的名称

    • cpp_class:即生成C++头文件.h中对该类的命名,可以在该类名前添加作用域such as:foo::bar::Classifier等自定义操作

    • graph:即之前步骤中生成的冻结图.pb产物

    • config:即上一步骤中产生的图描述文件

      此步骤完成

Step 4: 编译对应平台二进制文件 .o .h

使用命令:bazel build --verbose_failures //tensorflow/compiler/aot/custom:custom_interface

其中custom_interface对应上一步骤中的name

鉴于之前编译过tfcompile,此步骤只是使用了该产物中的部分资源进行编译所以不会有什么坑

这里介绍另一种编译此步骤产物的方式,需要在编译tfcompile步骤后产生的bazel-bin文件中找到tfcompile的run文件

通过命令tfcompile --graph=frozen_custom_010.pb --config=graph.config.pbtxt --cpp_class="Classifier"

这里贴下tfcompile的具体用法其中具有编译对应平台ABI的参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
tfcompile performs ahead-of-time compilation of a TensorFlow graph,
resulting in an object file compiled for your target architecture, and a
header file that gives access to the functionality in the object file.
A typical invocation looks like this:
$ tfcompile --graph=mygraph.pb --config=myfile.pbtxt --cpp_class="mynamespace::MyComputation"
usage: ./tfcompile
Flags:
--graph="" string Input GraphDef file. If the file ends in '.pbtxt' it is expected to be in the human-readable proto text format, otherwise it is expected to be in the proto binary format.
--config="" string Input file containing Config proto. If the file ends in '.pbtxt' it is expected to be in the human-readable proto text format, otherwise it is expected to be in the proto binary format.
--dump_fetch_nodes=false bool If set, only flags related to fetches are processed, and the resulting fetch nodes will be dumped to stdout in a comma-separated list. Typically used to format arguments for other tools, e.g. freeze_graph.
--target_triple="x86_64-pc-linux" string Target platform, similar to the clang -target flag. The general format is <arch><sub>-<vendor>-<sys>-<abi>. http://clang.llvm.org/docs/CrossCompilation.html#target-triple.
--target_cpu="" string Target cpu, similar to the clang -mcpu flag. http://clang.llvm.org/docs/CrossCompilation.html#cpu-fpu-abi
--target_features="" string Target features, e.g. +avx2, +neon, etc.
--entry_point="entry" string Name of the generated function. If multiple generated object files will be linked into the same binary, each will need a unique entry point.
--cpp_class="" string Name of the generated C++ class, wrapping the generated function. The syntax of this flag is [[<optional_namespace>::],...]<class_name>. This mirrors the C++ syntax for referring to a class, where multiple namespaces may precede the class name, separated by double-colons. The class will be generated in the given namespace(s), or if no namespaces are given, within the global namespace.
--out_function_object="out_model.o" string Output object file containing the generated function for the TensorFlow model.
--out_header="out.h" string Output header file name.
--out_metadata_object="out_helper.o" string Output object file name containing optional metadata for the generated function.
--out_session_module="" string Output session module proto.
--gen_name_to_index=false bool Generate name-to-index data for Lookup{Arg,Result}Index methods.
--gen_program_shape=false bool Generate program shape data for the ProgramShape method.
--xla_generate_hlo_graph="" string HLO modules matching this regex will be dumped to a .dot file throughout various stages in compilation.
--xla_hlo_graph_addresses=false bool With xla_generate_hlo_graph, show addresses of HLO ops in graph dump.
--xla_hlo_graph_path="" string With xla_generate_hlo_graph, dump the graphs into this path.
--xla_hlo_dump_as_graphdef=false bool Dump HLO graphs as TensorFlow GraphDefs.
--xla_hlo_graph_sharding_color=false bool Assign colors based on sharding assignments when generating the HLO graphs.
--xla_hlo_tfgraph_device_scopes=false bool When generating TensorFlow HLO graphs, if the HLO instructions are assigned to a specific device, prefix the name scope with "devX" with X being the device ordinal.
--xla_log_hlo_text="" string HLO modules matching this regex will be dumped to LOG(INFO).
--xla_generate_hlo_text_to="" string Dump all HLO modules as text into the provided directory path.
--xla_enable_fast_math=true bool Enable unsafe fast-math optimizations in the compiler; this may produce faster code at the expense of some accuracy.
--xla_llvm_enable_alias_scope_metadata=true bool In LLVM-based backends, enable the emission of !alias.scope metadata in the generated IR.
--xla_llvm_enable_noalias_metadata=true bool In LLVM-based backends, enable the emission of !noalias metadata in the generated IR.
--xla_llvm_enable_invariant_load_metadata=true bool In LLVM-based backends, enable the emission of !invariant.load metadata in the generated IR.
--xla_llvm_disable_expensive_passes=false bool In LLVM-based backends, disable a custom set of expensive optimization passes.
--xla_backend_optimization_level=3 int32 Numerical optimization level for the XLA compiler backend.
--xla_disable_hlo_passes="" string Comma-separated list of hlo passes to be disabled. These names must exactly match the passes' names; no whitespace around commas.
--xla_embed_ir_in_executable=false bool Embed the compiler IR as a string in the executable.
--xla_dump_ir_to="" string Dump the compiler IR into this directory as individual files.
--xla_eliminate_hlo_implicit_broadcast=true bool Eliminate implicit broadcasts when lowering user computations to HLO instructions; use explicit broadcast instead.
--xla_cpu_multi_thread_eigen=true bool When generating calls to Eigen in the CPU backend, use multi-threaded Eigen mode.
--xla_gpu_cuda_data_dir="./cuda_sdk_lib" string If non-empty, speficies a local directory containing ptxas and nvvm libdevice files; otherwise we use those from runfile directories.
--xla_gpu_ftz=false bool If true, flush-to-zero semantics are enabled in the code generated for GPUs.
--xla_gpu_disable_multi_streaming=false bool If true, multi-streaming in the GPU backend is disabled.
--xla_gpu_max_kernel_unroll_factor=4 int32 Specify the maximum kernel unroll factor for the GPU backend.
--xla_dump_optimized_hlo_proto_to="" string Dump Hlo after all hlo passes are executed as proto binary into this directory.
--xla_dump_unoptimized_hlo_proto_to="" string Dump HLO before any hlo passes are executed as proto binary into this directory.
--xla_dump_per_pass_hlo_proto_to="" string Dump HLO after each pass as an HloProto in binary file format into this directory.
--xla_test_all_output_layouts=false bool Let ClientLibraryTestBase::ComputeAndCompare* test all permutations of output layouts. For example, with a 3D shape, all permutations of the set {0, 1, 2} are tried.
--xla_test_all_input_layouts=false bool Let ClientLibraryTestBase::ComputeAndCompare* test all permutations of *input* layouts. For example, for 2 input arguments with 2D shape and 4D shape, the computation will run 2! * 4! times for every possible layouts
--xla_hlo_profile=false bool Instrument the computation to collect per-HLO cycle counts
--xla_dump_computations_to="" string Dump computations that XLA executes into the provided directory path
--xla_dump_executions_to="" string Dump parameters and results of computations that XLA executes into the provided directory path
--xla_backend_extra_options="" string Extra options to pass to a backend; comma-separated list of 'key=val' strings (=val may be omitted); no whitespace around commas.
--xla_reduce_precision="" string Directions for adding reduce-precision operations. Format is 'LOCATION=E,M:OPS;NAMES' where LOCATION is the class of locations in which to insert the operations (e.g., 'OP_OUTPUTS'), E and M are the exponent and matissa bit counts respectively, and OPS and NAMES are comma-separated (no spaces) lists of the operation types and names to which to attach the reduce-precision operations. The NAMES string and its preceding ';' may be omitted. This option may be repeated to define multiple sets of added reduce-precision operations.
--xla_gpu_use_cudnn_batchnorm=false bool Allows the GPU backend to implement batchnorm HLOs using cudnn, rather than expanding them to a soup of HLOs.
--xla_cpu_use_mkl_dnn=false bool Generate calls to MKL-DNN in the CPU backend.

最终生成三个产物:

产物
产物
  • cat custom_interface_lib.h

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    215
    216
    217
    218
    219
    220
    221
    222
    223
    224
    225
    226
    227
    228
    229
    230
    231
    232
    233
    234
    235
    236
    237
    238
    239
    240
    241
    242
    243
    244
    245
    246
    247
    248
    249
    250
    251
    252
    253
    254
    255
    256
    257
    258
    259
    260
    261
    262
    263
    264
    265
    266
    267
    268
    269
    270
    271
    272
    273
    274
    275
    276
    277
    278
    279
    280
    281
    282
    283
    284
    285
    286
    287
    // Generated by tfcompile, the TensorFlow graph compiler. DO NOT EDIT!
    //
    // This header was generated via ahead-of-time compilation of a TensorFlow
    // graph. An object file corresponding to this header was also generated.
    // This header gives access to the functionality in that object file.
    //
    // clang-format off
    #ifndef TFCOMPILE_GENERATED___xla_tensorflow_compiler_aot_custom__custom_interface_H_ // NOLINT(build/header_guard)
    #define TFCOMPILE_GENERATED___xla_tensorflow_compiler_aot_custom__custom_interface_H_ // NOLINT(build/header_guard)
    #include "tensorflow/compiler/tf2xla/xla_compiled_cpu_function.h"
    #include "tensorflow/core/platform/types.h"
    namespace Eigen { struct ThreadPoolDevice; }
    namespace xla { class ExecutableRunOptions; }
    // (Implementation detail) Entry point to the function in the object file.
    extern "C" void __xla_tensorflow_compiler_aot_custom__custom_interface(
    void* result, const ::xla::ExecutableRunOptions* run_options,
    const void** args, void** temps, tensorflow::int64* profile_counters);
    // Classifier represents a computation previously specified in a
    // TensorFlow graph, now compiled into executable code. This extends the generic
    // XlaCompiledCpuFunction class with statically type-safe arg and result
    // methods. Usage example:
    //
    // Classifier computation;
    // // ...set args using computation.argN methods
    // CHECK(computation.Run());
    // // ...inspect results using computation.resultN methods
    //
    // The Run method invokes the actual computation, with inputs read from arg
    // buffers, and outputs written to result buffers. Each Run call may also use
    // a set of temporary buffers for the computation.
    //
    // By default each instance of this class manages its own arg, result and temp
    // buffers. The AllocMode constructor parameter may be used to modify the
    // buffer allocation strategy.
    //
    // Under the default allocation strategy, this class is thread-compatible:
    // o Calls to non-const methods require exclusive access to the object.
    // o Concurrent calls to const methods are OK, if those calls are made while it
    // is guaranteed that no thread may call a non-const method.
    //
    // The logical function signature is:
    // (arg0: f32[1,160,160,3]) -> (f32[1,30])
    //
    // Memory stats:
    // arg bytes total: 307200
    // arg bytes aligned: 307200
    // temp bytes total: 4143008
    // temp bytes aligned: 4143104
    class Classifier final : public tensorflow::XlaCompiledCpuFunction {
    public:
    // Number of input arguments for the compiled computation.
    static constexpr size_t kNumArgs = 1;
    // Byte size of each argument buffer. There are kNumArgs entries.
    static const ::tensorflow::int64 ArgSize(::tensorflow::int32 index) {
    return BufferInfos()[ArgIndexToBufferIndex()[index]].size();
    }
    // Returns static data used to create an XlaCompiledCpuFunction.
    static const tensorflow::XlaCompiledCpuFunction::StaticData& StaticData() {
    static XlaCompiledCpuFunction::StaticData* kStaticData = [](){
    XlaCompiledCpuFunction::StaticData* data =
    new XlaCompiledCpuFunction::StaticData;
    set_static_data_raw_function(data, __xla_tensorflow_compiler_aot_custom__custom_interface);
    set_static_data_buffer_infos(data, BufferInfos());
    set_static_data_num_buffers(data, kNumBuffers);
    set_static_data_arg_index_table(data, ArgIndexToBufferIndex());
    set_static_data_num_args(data, kNumArgs);
    set_static_data_result_index(data, kResultIndex);
    set_static_data_arg_names(data, StaticArgNames());
    set_static_data_result_names(data, StaticResultNames());
    set_static_data_program_shape(data, StaticProgramShape());
    set_static_data_hlo_profile_printer_data(
    data, StaticHloProfilePrinterData());
    return data;
    }();
    return *kStaticData;
    }
    Classifier(AllocMode alloc_mode =
    AllocMode::ARGS_VARIABLES_RESULTS_PROFILES_AND_TEMPS)
    : XlaCompiledCpuFunction(StaticData(), alloc_mode) {}
    Classifier(const Classifier&) = delete;
    Classifier& operator=(const Classifier&) = delete;
    // Arg methods for managing input buffers. Buffers are in row-major order.
    // There is a set of methods for each positional argument, with the following
    // general form:
    //
    // void set_argN_data(void* data)
    // Sets the buffer of type T for positional argument N. May be called in
    // any AllocMode. Must be called before Run to have an affect. Must be
    // called in AllocMode::RESULTS_PROFILES_AND_TEMPS_ONLY for each positional
    // argument, to set the argument buffers.
    //
    // T* argN_data()
    // Returns the buffer of type T for positional argument N.
    //
    // T& argN(...dim indices...)
    // Returns a reference to the value of type T for positional argument N,
    // with dim indices specifying which value. No bounds checking is performed
    // on dim indices.
    void set_arg0_data(const void* data) {
    set_arg_data(0, data);
    }
    float* arg0_data() {
    return static_cast<float*>(arg_data(0));
    }
    float& arg0(size_t dim0, size_t dim1, size_t dim2, size_t dim3) {
    return (*static_cast<float(*)[1][160][160][3]>(
    arg_data(0)))[dim0][dim1][dim2][dim3];
    }
    const float* arg0_data() const {
    return static_cast<const float*>(arg_data(0));
    }
    const float& arg0(size_t dim0, size_t dim1, size_t dim2, size_t dim3) const {
    return (*static_cast<const float(*)[1][160][160][3]>(
    arg_data(0)))[dim0][dim1][dim2][dim3];
    }
    // Result methods for managing output buffers. Buffers are in row-major order.
    // Must only be called after a successful Run call. There is a set of methods
    // for each positional result, with the following general form:
    //
    // T* resultN_data()
    // Returns the buffer of type T for positional result N.
    //
    // T& resultN(...dim indices...)
    // Returns a reference to the value of type T for positional result N,
    // with dim indices specifying which value. No bounds checking is performed
    // on dim indices.
    //
    // Unlike the arg methods, there is no set_resultN_data method. The result
    // buffers are managed internally, and may change after each call to Run.
    float* result0_data() {
    return static_cast<float*>(result_data(0));
    }
    float& result0(size_t dim0, size_t dim1) {
    return (*static_cast<float(*)[1][30]>(
    result_data(0)))[dim0][dim1];
    }
    const float* result0_data() const {
    return static_cast<const float*>(result_data(0));
    }
    const float& result0(size_t dim0, size_t dim1) const {
    return (*static_cast<const float(*)[1][30]>(
    result_data(0)))[dim0][dim1];
    }
    // Methods for managing variable buffers. Buffers are in row-major order.
    //
    // For read-write variables we generate the following methods:
    //
    // void set_var_X_data(T* data)
    // Sets the buffer for variable X. Must be called before Run if the
    // allocation mode is RESULTS_PROFILES_AND_TEMPS_ONLY.
    //
    // T* var_X_data()
    // Returns the buffer of type T for variable X. If the allocation mode is
    // RESULTS_PROFILES_AND_TEMPS_ONLY then this buffer is the same as the
    // buffer passed to set_var_X_data.
    //
    // T& var_X(...dim indices...)
    // Returns a reference to the value of type T for variable X,
    // with dim indices specifying which value. No bounds checking is performed
    // on dim indices.
    //
    // For readonly variables we generate the same set of methods, except that we
    // use `const T` instead of `T`. We use `const T` to avoid erasing the
    // constness of the buffer passed to `set_var_X_data` but the underlying
    // buffer is not const (and thus the const can be safely const-cast'ed away)
    // unless `set_var_X_data` is called with a pointer to constant storage.
    private:
    // Number of buffers for the compiled computation.
    static constexpr size_t kNumBuffers = 50;
    static const ::xla::cpu_function_runtime::BufferInfo* BufferInfos() {
    static const ::xla::cpu_function_runtime::BufferInfo
    kBufferInfos[kNumBuffers] = {
    ::xla::cpu_function_runtime::BufferInfo({2293760ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({1228802ULL, 0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({614400ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({602112ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({301056ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({301056ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({301056ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({301056ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({301056ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({172032ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({98304ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({98304ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({98304ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({98304ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({98304ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({73728ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({36864ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({24576ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({24576ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({24576ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({24576ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({24576ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({12288ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({6912ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({6144ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({6144ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({6144ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({6144ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({6144ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({2048ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({481ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({33ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({16ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
    ::xla::cpu_function_runtime::BufferInfo({16571521ULL, ~0ULL})
    };
    return kBufferInfos;
    }
    static const ::tensorflow::int32* ArgIndexToBufferIndex() {
    static constexpr ::tensorflow::int32 kArgIndexToBufferIndex[kNumArgs] = {
    1
    };
    return kArgIndexToBufferIndex;
    }
    // The 0-based index of the result tuple in the temporary buffers.
    static constexpr size_t kResultIndex = 38;
    // Array of names of each positional argument, terminated by nullptr.
    static const char** StaticArgNames() {
    return nullptr;
    }
    // Array of names of each positional result, terminated by nullptr.
    static const char** StaticResultNames() {
    return nullptr;
    }
    // Shape of the args and results.
    static const ::xla::ProgramShapeProto* StaticProgramShape() {
    static const ::xla::ProgramShapeProto* kShape = nullptr;
    return kShape;
    }
    // Metadata that can be used to pretty-print profile counters.
    static const ::xla::HloProfilePrinterData* StaticHloProfilePrinterData() {
    static const ::xla::HloProfilePrinterData* kHloProfilePrinterData =
    nullptr;
    return kHloProfilePrinterData;
    }
    };
    #endif // TFCOMPILE_GENERATED___xla_tensorflow_compiler_aot_custom__custom_interface_H_
    // clang-format on

可以看到其中的具体图已经转换为对应运行时指令。

其中的BufferInfo标记的是为.o文件中的具体运行时二进制字块

Step 5: 编写代码调用AOT模型

这一步中需要注意最终形成的.so在什么平台进行使用,当我们在移动端(Android)使用时,与C++进行通讯需要JNI的支持所以在这一步需要重新configure Tensorflow源码配置SDK、NDK支持,具体SDK\NDK对应target自行选择

下文中的调用C++代码将采用复合JNI规范的代码编写

项目Java书写对应native方法生成对应JNI头文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package com.qihoo.cleandroid.sdk.imageclassfier.core.classfier.process;
/**
* Created by zhanghongxin on 2019/7/22.
*/
public class CustomClassifier {
private static final String LIBNAME = "custom_interface";
private CustomClassifier() {}
/**
* Load the TensorFlowLite runtime C library.
*/
static boolean init() {
try {
System.loadLibrary(LIBNAME);
return true;
} catch (UnsatisfiedLinkError e) {
System.err.println("custom_interface: failed to load native library: " + e.getMessage());
return false;
}
}
static {
init();
}
public static native void getPredictResult(float[][][][] input, float[][] output, int inputSize, int outputSize);
}

生成头文件com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.h

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/* DO NOT EDIT THIS FILE - it is machine generated */
#include <jni.h>
/* Header for class com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier */
#ifndef _Included_com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier
#define _Included_com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier
#ifdef __cplusplus
extern "C" {
#endif
/*
* Class: com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier
* Method: getPredictResult
* Signature: ([[[[F[[FII)V
*/
JNIEXPORT void JNICALL Java_com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier_getPredictResult
(JNIEnv *, jclass, jobjectArray, jobjectArray, jint, jint);
#ifdef __cplusplus
}
#endif
#endif

编写C++代码实现模型调用com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
#define EIGEN_USE_THREADS
#define EIGEN_USE_CUSTOM_THREAD_POOL
#include <iostream>
#include <cstdio>
#include <jni.h>
#include <android/log.h>
#include "custom_interface_lib.h"
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
float* run(float* input, float* output, int input_size, int output_size){
std::cout << "Load .so SUCCESS" << std::endl;
Eigen::ThreadPool tp(std::thread::hardware_concurrency());
Eigen::ThreadPoolDevice device(&tp, tp.NumThreads());
Classifier classifier;
classifier.set_thread_pool(&device);
std::copy(input, input + input_size, classifier.arg0_data());
auto ok = classifier.Run();
if (not ok) std::cout << "NOT OK" << std::endl;
//
// std::cout << "input:";
// std::cout << input << std::endl;
//
// std::cout << "input_size:";
// std::cout << input_size << std::endl;
//
// std::cout << "classifier.arg0_data():";
// std::cout << classifier.arg0_data() << std::endl;
//
// std::cout << "output:";
// std::cout << output << std::endl;
//
// std::cout << "output_size:";
// std::cout << output_size << std::endl;
//
// std::cout << "result0_data():";
// std::cout << classifier.result0_data() << std::endl;
//
// for(int i = 0; i < 30; i++){
// std::cout << "restul0_";
// std::cout << i;
// std::cout << " : ";
// std::cout << classifier.result0(0,i) << std::endl;
// __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~OUTPUT== %f~~~~~~~~~~~~~~~\n", classifier.result0(0,i));
// }
std::copy(classifier.result0_data(), classifier.result0_data() + output_size, output);
return output;
}
#ifndef _Included_com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier
#define _Included_com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier
#ifdef __cplusplus
extern "C" {
#endif
JNIEXPORT void JNICALL Java_com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier_getPredictResult
(JNIEnv *env, jobject obj, jobjectArray inputArray, jobjectArray outputArray, jint inputSize, jint outputSize){
jboolean isCopy = JNI_FALSE;
jint rows = env->GetArrayLength(inputArray);
// __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~inputRows== %d~~~~~~~~~~~~~~~\n", rows);
jobjectArray tempInputArray = (jobjectArray)env->GetObjectArrayElement(inputArray, 0);
// __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~input: 1~~~~~~~~~~~~~~\n");
jobjectArray tTempInputArray = (jobjectArray)env->GetObjectArrayElement(tempInputArray, 0);
// __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~input: 2~~~~~~~~~~~~~~~\n");
jobjectArray tTTempInputArray = (jobjectArray)env->GetObjectArrayElement(tTempInputArray, 0);
// __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~input: 3~~~~~~~~~~~~~~~\n");
jfloat* input = env->GetFloatArrayElements((jfloatArray)tTTempInputArray, 0);
// __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~input: %f~~~~~~~~~~~~~~~\n", input);
jobjectArray tempOutputArray = (jobjectArray)env->GetObjectArrayElement(outputArray,0);
// __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~output: 1~~~~~~~~~~~~~~~\n");
jfloat* output = env->GetFloatArrayElements((jfloatArray)tempOutputArray, 0);
// __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~output: %f~~~~~~~~~~~~~~~\n", output);
jfloat* resultPointer = run(input, output, inputSize, outputSize);
// jclass floatArrayClz = env->FindClass("[[F");
// if(floatArrayClz == NULL) return NULL;
// outputArray = env->NewObjectArray(outputSize, floatArrayClz, NULL );
// if(outputArray == NULL) return NULL;
for(int i = 0; i < 1; i++){
jfloat temp[outputSize];
jfloatArray floatArray = env->NewFloatArray(outputSize);
// if(floatArray == NULL) return NULL;
for(int j=0; j < outputSize; j++){
temp[j] = *(resultPointer + j);
}
// __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~output: 2~~~~~~~~~~~~~~~\n");
env->SetFloatArrayRegion(floatArray, 0, outputSize, temp);
// __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~output: 3~~~~~~~~~~~~~~~\n");
env->SetObjectArrayElement(outputArray, i, floatArray);
// __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~output: 4~~~~~~~~~~~~~~~\n");
env->DeleteLocalRef(floatArray);
}
// __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~output: %f~~~~~~~~~~~~~~~\n", outputArray[0]);
}
#ifdef __cplusplus
}
#endif
#endif

此步骤结束

Step 6: 编写BUILD

此步骤中所编写的BUILD文件与Step 3中所编写文件为同一文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
cc_library(
name = "library",
hdrs = ["custom_interface_lib.h"],
srcs = ["custom_interface_tfcompile_function.o","custom_interface_tfcompile_metadata.o"],
)
cc_binary(
name = "libcustom_interface.so",
srcs = [
"com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.h",
"com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.cc",
],
deps = [
":library",
"//tensorflow/compiler/tf2xla:xla_compiled_cpu_function",
"//tensorflow/core:framework_lite",
"//tensorflow/compiler/xla:cpu_function_runtime",
"//tensorflow/compiler/xla/service/cpu:runtime_conv2d",
"//tensorflow/compiler/xla/service/cpu:runtime_matmul",
"//third_party/eigen3",
],
linkopts = [
"-landroid",
"-shared",
],
linkshared = 1,
linkstatic = 1,
copts = ["-fPIC"],
)

其中cc_binary中的name即为生成.so的名称该名称应符合JNI规范

注意生成.so动态链接库时需配置linkshared = 1 linkstatic = 1

Step 7: 编译对应平台最终产物 .so

运行命令:bazel build -c opt //tensorflow/compiler/aot/custom:libcustom_interface.so \ --crosstool_top=//external:android/crosstool \ --host_crosstool_top=@bazel_tools//tools/cpp:toolchain \ --cpu=armeabi-v7a

通过--cpu=xxx来控制编译对应平台ABI的.so

至此编译后所有步骤完成

实验结论

通过XLA加速接入移动端对比(其实并没有可比性因为没有控制变量)

对比
对比

上图中显示了在lite和xla两种方式下使用同一种机型预测174张图片的耗时对比数据

明显lite在这一方面速度非常有优势,但并不代表XLA不起作用,而是恰巧在我们的计算图中使用的卷积网络在这种情况下不适用于XLA进行优化,官方使用JIT的速度对比,其实在JIT的训练时数据和AOT的运行时数据事实上是对应关系恰巧反应了这一关系。

后附几张官方演示图:


Reference